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(57) ABSTRACT 

A technique for temporal smoothing of results from a scene 
analysis process which creates a sequence of visually pleas- 
ing and acceptable images generated in whole or part from 
such results. The technique applies temporal smoothing 
across time-related sets of scene analysis results. The spatial 
smoothing can be applied at various steps in the process: to 
images in the original sequence, to intermediate or final 
results of scene analysis in either a pixel-oriented or geo- 
metric domain, or to the images generated in whole or part 
from the scene analysis results. In a preferred embodiment, 
different levels of smoothing are applied to different parts of 
the intermediate or final results. The differentiation can be 
done using image masks (for pixel-oriented results) or 
geometry selection techniques (for geometric results). This 
allows a higher level of smoothing in certain areas where the 
generated visual images will benefit from such additional 
smoothing, while avoiding over-smoothing in other areas 
where the smoothing would obscure critical details within 
the generated images. 
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TEMPORAL SMOOTHING OF SCENE 
ANALYSIS DATA FOR IMAGE SEQUENCE 
GENERATION 

BACKGROUND 

Scene analysis is the process of using computer-based 
systems to derive estimated information about a visual scene 
from a series of images of that same visual scene. One 
application of these techniques is in media production — the 
process of creating media content for use in films, videos, 
broadcast television, television commercials, interactive 
games, CD-ROM titles, DVD titles, Internet or intranet web 
sites, and related or derived formats. These techniques can 
be applied in the pre-production, production and post- 
production phases of the overall media production process. 
Other design visualization application areas include indus- 
trial design and architecture. Scene analysis is also applied 
in areas such as surveillance, reconnaissance, medicine, and 
the creation of simulation environments for entertainment, 
training, education, and marketing. 

By recovering the estimated scene structure, it is possible 
lo treat the elements of a visual scene as abstract three- 
dimensional geometric and/or volumetric objects that can be 
processed, manipulated and combined with other data 
objects. There are multiple techniques to estimate the param- 
eters of a visual scene from a series of two-dimensional 
images of that scene. Various aspects of the visual scene, 
including both two-dimensional and three-dimensional 
information, can be estimated using techniques such as: 

estimating extrinsic camera parameters (such as camera 
path and camera orientation); 

estimating intrinsic camera parameters (such as focal 
length and optical center); 

estimating depths in the scene for pixels in the images 
(depth maps); 

estimating geometry and/or materials properties of 
objects and their surfaces in the scene (estimating scene 
structure); 

estimating areas of motion in the scene (motion 
detection); and 

estimating paths of feature points and/or objects in the 
scene (2D and 3D feature tracking). 

Underlying these techniques are algorithms that can be 
categorized into a few general classes. Methods based on 
dense optical flow attempt to recover the optical flow vectors 
across pairs of images, typically on a per-pixel basis. Meth- 
ods based on feature tracking typically select visual features 
in one image (based on criteria such as areas of high 
contrast) and attempt to track the path of each selected 
feature across a series of related images. Most feature 
tracking methods track a relatively sparse array of features, 
ranging from a single feature to a few hundred. 

For example, Horn, B. K. P. and Schunck, B. G., in 
"Determining Optical Flow," Artificial Intelligence, Vol. 17, 
pp. 185-203 (1981) describe how so-called optical flow 
techniques may be used to detect velocities of brightness 
patterns in an image stream to segment the image frames 
into pixel regions corresponding to particular visual objects. 

Becker, S. and Bove, V. M., in "Semiautomatic 3D Model 
Extraction from Uncalibrated 2-D Camera \^ews," Proceed- 
ings SPIE Visual Data Exploration and Analysis II, vol. 
2410, pp. 447-461 (1995) describe a technique for extract- 
ing a three-dimensional (3-D) scene model from two- 
dimensional (2-D) pixel-based image representations as a set 
of 3-D mathematical abstract representations of visual 
objects in the scene as well as camera parameters and depth 
maps. 
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Sawhney, H. S,, in "3D Geometry from Planar Parallax", 
IEEE 1063-6919/94 (1994), pp. 929-934 discusses a tech- 
nique for deriving 3-D structure through perspective pro- 
jection using motion parallax defined with respect to an 

5 arbitrary dominant plane. 

Poelman, C. J. et al in "A Paraperspective Factorization 
Method for Shape and Motion Recovery**, Dec. 11, 1993, 
Carnegie Mellon University Report (MU-CS-93-219), 
elaborates on a factorization method for recovery both the 

10 shape of an object and its motion from a sequence of images, 
using many images and tracking many feature points. 

A new method, the dense feature array method, tracks a 
dense array of features across a series of images. The dense 
feature array method can track feature densities up to a 

15 per-pixel level, and is described in a co-pending United 
States patent application by Moorby, P. R., cntided "FeaUire 
Tracking Using a Dense Feature Array", serial number 
09/054,866 filed on Apr. 3, 1998 and assigned to SynaPix, 
Inc., the assignee of the present application. 

20 However, a key barrier to widespread adoption of these 
technk}ues is the inherent uncertainty and inaccuracy of any 
estimation technique. The information encoded in the 
images about the actual elements in the visual scene and 
their relationships is incomplete, and typically distorted by 

25 artifacts of the imaging process. As a result, the utihty of 
automated scene analysis drops dramatically as the estimates 
become unreliable and unpredictable wherever the informa- 
tion encoded in the images is noisy, incomplete or simply 
missing. 

30 For example, inter-object occlusions, shadows, specular 
highlights and imaging artifacts such as noise and distortion 
can all interact in the pixel values of the images. When some 
or all of these factors are introduced, the probability 
increases that the scene analysis process will produce results 

35 that are not meaningful, unstable or inaccurate for some 
areas of the scene. 

An error in a single result is often propagated within the 
analysis process. Errors are propagated when results for 
local regions of the image are correlated, interpolated or 

40 used in fitting global parameters of the scene. For example, 
parallax methods of deriving depths of objects in the scene 
perfonm a global fit of scene parameters based on a set of 
local parameters computed across the images being 
matched. 

45 Other examples of potential error propagation can be 
found in feature tracking algorithms such as described in 
Tomasi, C, et al., "Shape and Motion from Image Streams: 
a Factorization Method", Technical Report CMU-CS-91- 
172, Carnegie Mellon University (1991), or in a scene 

50 structure recovery method based on feature tracking as in the 
above- referenced Poelman publication. In these types of 
scene analysis methods, errors are propagated when a match 
between local features is used to cither estimate or constrain 
matches in neighboring local regions, or estimate or coo- 

55 strain matches across images. 

For certain applications, the time-based sequence of input 
images is analyzed in order to develop a continuous 
sequence of interpreted images as an output result. Depend- 
ing on the technique in use, these results can be visually 

60 interpreted and presented as pixel values in images associ- 
ated with the original source images, or as geometric objects 
(such as sets of points, curves or meshes) related to the 
original source images. 
When such visually interpreted results are themselves 

65 presented in a temporal sequence of images generated in 
whole or part from such results, the errors and artifacts in the 
results appear as anomalies in the generated images. Since 
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the errors and uncertainties in the results can change and with spatial smoothing, thus further improving the images 

shift in the visual sequence, the visual anomaUes in the generated in whole or part from such results, 

generated images will appear to occur in either random The temporal smoothing is applied across time-related 

patterns ("dancing" around in the image sequence) or dis- sets of scene analysis results. Temporal smoothing can be 

cernible patterns ("rolling" through the image sequence). 5 done across these results in cither a pixel-oriented or geo- 

These visual anomalies can be confusing, visually metric domain, depending on the scene analysis technique 

displeasing, distracting and/or annoying to the viewer. being used. The spatial smoothing can be applied at one or 

In some applications, such as military intelligence more steps in the process: to images in the original sequence, 

systems, the trained professionals using such systems are to intermediate or final results of scene analysis in either a 

expected to tolerate some level of visual anomalies, lo pixel-oriented or geometric domain, or to the images gen- 

However, in design visualization apphcations, the eventual erated in whole or part from the scene analysis results, 

viewer of the images generated from the interpreted results ^ preferred embodiment, different levels of smoothing 

is expected to have a relatively low tolerance for visual ^rc applied to different parts of the intermediate or final 

anomalies. / results. The differentiation can be done using image masks 

For example, m the production (mcludmg the post- 15 (for pixel-oricntcd results) or geometry selection techniques 
production phase) of media content, every effort must be geometric results). This aUows a higher level of smooth- 
made to avoid, correct or otherwise rectify such visual ^^ftain areas where the generated visual images wQl 
anomalies. The same is true in most design visuaUzation ^^^^^^^ f^^^ additional smoothing, while avoiding 
systems. If such visual anomalies are difficult to filter or over-smoothing in other areas where the smoothing would 
smooth automatically (both withm each generated image 20 obscure critical details within the generated images, 
and across a temporal sequence of such generated images), . ^^^^^^.^ ^j^^^ .^^ ^.^^^^ 
pains^kmglabor-in ensiveadjustmentsarerequiredmorder ^ ^^^^ ^^^^ time-based sequence 
to achieve acceptable quahty. c i ■ u u c j * *■ 

^ J i J r 1- J,- or scene analysis results, or be specified to vary over time. 

Consider also that the producer of multimedia content t- *u \u c n u 

, • n • u . J 1 ' *i . Furthermore, the specification process can optionally be 

typically wishes to use one or more scene models m the first 25 ^ n j u , j - . . /■ . 

y ^ . . • r *u controlled by user mark-up and mteractive parameter adjust- 

place to create as accurate a representation of the scene as , . .i. j C j i_ j • 

^ , , . , ^ . . . ment m a method such as the one described in the 

possible. For example, consider a mo ion picture environ- ^ y^ted States patent application by Madden, P. 

ment where computer-generated speaal effects are to appear ^ , i j ; a -^^x* i a ^ r. 

'^t_ t i.' ^ J B.,et. al, entitled Computer Assisted Mark-up And Param- 

m a scene with real world objects and actors. The content /. ^ c a i -^^ija ^ iono o xt 

. . * J t . . etenzation For Scene Analysis , filed Apr. 6, 1998, Ser. No. 

creator may choose to start by creatme models that represent 30 r\njr\c^ m--^ j • j I ^ n- t ■ r 

^ i. J,, / r, r 09/056,022, and assigned to SynaPix, Inc., the assignee of 

vanous static and/or dynamic aspects of the scene. Some of ^ i - »■ iT • u u • * j 

, , 1 , . . /• 1- . . 1 • the present application, which is hereby mcorporated by 

these models can be denved from digitized moLon picture reference j r j 

film using automatic image-interpretation techniques and ^ * , , . . , 

then proceed to combine computer-generated abstract ele- , temporal smoothing method modifies or constrains 

mcnts with the elements derived from imagc-intcrprctation 35 sequence of scene analysis results so that images gen- 

in a visuaUy and aestheticaUy pleasing way. "^'^'^ ^""^ '^"^^ ^PP^" ""^^^V ""^'^lent from 

Problems can occur with this approach, however, since """8^ '° "^^^^ ^his is done through a combmation of 
automatic image-interpretation processes are statistical in f^"=™e '^'^ ^^^'^ adjustments app ied these results. The 
nature, and the input image pixels are themselves the results ^f*""? '^'^""^^ '^"P"'*^ ^"^"^^ ('*'"P' variations in 
of a sampUng and filtering process. Consider that images are 40 otherwise smoothly varymg results), temporal "sparkles" 
sampled from two-dimensional (2-D) projections (onto a <[^''^°'^ yananons in results) and temporal "ripples" (results 
camera's imaging plane) of three-dimensional (3-D) physi- lh»l.should have a constant slope when plotted against time, 
cal scenes. Not only does this sampling process introduce ^"^'"'^ "^'V *'°P«)- ^^^^8 adjustments 
errors, but also the projection mto the 2-D image plane of the P'^^"' '^'"P""^ "jumps"(discontinuities) in scene analysis 
camera limits the amount of 3-D information that can be 45 Parameters that can only be eslmjated lo withm a scabng 
recovered from these images. The 3-D characteristics of "dor- 
objects in the scene, 3-D movement of objects, and 3-D While the temporal smoothing method (with optional 
camera movements can typically only be partially quantified spatial smoothing) can be applied to results from a wide 
from sequences of images provided by cameras. range of scene analysis techniques, it is particularly effective 

As a result, image -interpretation processes do not always 50 i° smoothing time-oriented results from techniques for esti- 

automatically converge to the correct solution. For example, mating three-dimensional parameters of a scene. This 

even though one might think it is relatively straight forward includes methods for estimating three-dimensional camera 

to derive a 3-D mathematical representation of a simple Path and orientation in a scene, methods for estimating the 

object such as a soda can from sequences of images of that depths of object surfaces in a scene, or methods for csti- 

soda can, a process for determining the location and size of 55 mating the geometric structure of such surfaces, 

a 3-D wire frame mesh needed to represent the soda can may gj^j^p cESCRIFnON OF THE DRAWINGS 
not properly converge, depending upon the fighting, camera 

angles, and so on used in the original image capture. The foregoing and other objects, features and advantages 

Because of the probabilistic nature of this type of model, the of the invention will be apparent from the following more 

end result cannot be reliably prediaed. 60 particular description of preferred embodiments of the 

™„ „„„ , invention, as iUustrated in the accompanying drawings in 

SUMMARY OF THE INVENTION yj,, „fer to the same parts 

The invention is a method for temporal smoothing of throughout the different views. The drawingis are not nec- 

results from a scene analysis process, such as to create, in cssarily to scale, emphasis instead being placed upon illus- 

real time, a sequence of visually pleasing and acceptable 65 trating the principles of the invention, 

images generated in whole or part from such results. The FIG. 1 is a block diagram of an image processing system 

invention is also a method of combining temporal smoothing which develops a scene model according to the inventions. 
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FIG. 2 is an illustration of various fiinctioDal elements and previously described, sometimes in conjunction with cap- 
data structures used in the scene model. tured parametric data 39A, to extract and interpret informa- 

FIG. 3 is a flow diagram of an image analysis process in t'^n about the captured images 39. 

which temporal smoothing and/or filtering are appUed '^^ information extracted from the physical scene, as 

according to the invention. 5 detected and/or estimated from the captured miage(s), then 

„^ ^ . . become the basis for generating initial depth maps 55 and 

FIG. 4 is an original miage of a kitchen scene. ^^^^^ ^^^^^^ ^^^^ characterize the scene. The initial 

FIG. 5 is a 3-D plot of raw gamma values produced by an depth maps 55 and surface meshes 56 may contain infor- 

image analysis process for an image pair captured from the mation not only derived from the captured image sources 

kitchen scene. themselves, suci as VTIOVCR/DDR 34 (camera 35 and fihu 

FIG. 6 illustrates gamma values after being smoothed scanner/telecine 36) but also that derived from camera data 

according the invention. capture 35A, manual data entry 38A and other secondary 

r^^o^f-*"! j-f* r L L. sensors 38. In addition, depth maps 55 and surface meshes 

BGS. 7-1, 7-2, and 7-3 arc a sequence of raw height ^^^^^^ ^^^^ j^^^.^ ^ ^^^^^^ 

estimates taken from depth map results plotted relative to a ^y external computer systems such as graphics systems and 

plane such as the table m the origmal image sequence. 15 ^^^^j. computer modeling systems. 

FIGS. 8-1, 8-2, and 8-3 illustrate the height estimates after Further refinement of the depth maps 55 and surface 

smoothing. meshes 56 may be achieved in a number of difiEcrcnt ways. 

In a first scenario, the results of the initial pass of the image 

DETAILED DESCRIPTION OF THE analysis function 42 are represented as depth maps 55 and/or 

INVENTION 20 surface meshes 56, presenting the user of the system 10 with 

a rendition of the scene via the scene viewer 44 for com- 

1. Introduction parison to the original 2-D images 50. The user then 

Turning attention now in particular to the drawings, FIG. P^^^^^^« ^Pif ^^/^"^ ^"Jf 5^ 'f""^ 

. . LI 1 J- f *t. * f J- '^T' depth maps 55 and surface meshes 56. This can be done via 

1 is a block diagram of the components of a digital image ^.^ i . i u u *u ^ a • c 

i_- 1. 1- . 1 .u- 25 the mark Up tools 48 whereby the user provides information 

pracessing system 10 whidi applies temporal smoothing ,o , he system identifying elements or regions in the image as 

techniques according to the mvention. The system 10 j^^^^^ ^j,^^^ geometric 

includes a computer workstation 20, a computer monitor 21, abstractions and/or pixel regions, such information being 

and input devices such as a keyboard 22, mouse 23 and ^^j^j ^^^^ „p ^^i^ 4, ^^^^^^^ f^^^-^^^ 43 ^^j. 

tablet 23A. The workstation 20 also includes input/output f^^,^^ utilizing this additional mark up data 49 

interfaces 24, storage 25, such as a disk 26 and random combined with the 2-D images SO to produce a modified set 

access memory 27, as well as one or more processors 28. j j[, 55 ^^^^^^ ^^^^^^^ ^ 

The workstation 20 may be a computer graphics worksution subsequently displayed in the scene viewer 44. 

such as the 02/Octane sold by Silicon Graphics, Inc., a ,• • . u • a . T^r< t ■ .u • •.• 1 

J vn^ . 1 . .• •. Li . Continumg to pay attention briefly to FIG. 2, m the imtial 

WmdowsOT type-work sta ion, or other suitable computer ^ techniques 42 based strictly on the input 

or computers The computer monitor 21, keyboard 22, ^ ^^^^^^^ 3, ^^^.^^ .^^^ 

mouse 23, tablet 23 A, and other input devices are used to a^^^u «« ««^/«* ™^„u^„ «ic 

' . ' 1 r . . . . 40 con taming depth maps 55 and/or surface meshes 56 that 

interact with various soitware elements or the system exist- . .u i *• «j *u -i ■ i • i 

. , , J estimatetherelalive depths and positions of pixels or pixel 

mg m the workstation 20 to cause programs to be run and 'T tv:,, »^^JL^r~ 

.'^ , . . M , , , regions in the origmal 2-D images. Inis process may typi- 

data to be stored as described below, n i - ij .- *• . am. 

v^l^J ^^jy mclude estunatmg camera parameters 45 to pro- 

The system 10 also includes a number of other hardware vide depth estimates such as computed from image paraUax 

elements typical of an image processing system, such as a ^nd/or feamre tracking between muhiple images of the same 

video monitor 30, audio monitors 31, hardware accelerator scene, either successive images from the same camera, or 

32, and user input devices 33. Also included are image images from two or more cameras. Data from other sensors 

capture devices, such as a video cassette recorder (VCR), ^5 g^ch as laser range-finders, can also be used in depth 

video tape recorder (VTR), and/or digital disk recorder 34 estimation. Such estimates and additional data can be used 

(DDR), cameras 35, and/or film scanner/telecine 36. Digi- the analysis techniques 42 in combination with the mark 

tized camera parameters 35A may be captured to provide yp ^^^^ 49 ^ produce frirthcr refinements of the depth maps 

data and information concerning type, position, lens, focal 55 and/or surface meshes 56. 

length and other information about the cameras 35. Sensors ^ ^^^^^^ scenario, the results of the imtial pass of the 

38 and manual data entry 38A may also provide mformation ■ ^^^^ • ^^^^-^^ 42 

arc represented as depth maps 55 

about the scene and miagc capture devices. ^^/^^ ^^^^^ ^^^^^ 55 presented to the user via the 

The present invention is intended for smoothing a con- scene viewer 44 as described earlier. The user then provides 

tinuous sequence of input images 39 at video playback rates. input in the form of parametric adjustments 47 performed 

To accomplish this, software processes associated with the 55 within the parametric adjustment tools 46 portion of the user 

system 10 develop a modular scene model 40. This modular interface 43. Such inputs may include adjustments to the 

scene model 40 includes one or more of the original 2-D parameters such a focal length, the physical distance 

image sequence 50; captured parametric data and infonna- between two points in the scene, referred to as a "ground 

tion 51; mark up data 49 related to the images 50; adjust- truth", camera position in time, camera shutter speed, and 

ments 47 to parametric data 51; depth maps 55 for the go camera aperture settings for given frames of the captured 

images 50; and surface meshes 56 for the images 50, image streams 39. Analysis function 42 is performed again 

As shown in greater detail in FIG. 2, in a preferred utilizing this additional parametric information 47 combined 

embodiment, the modular scene model 40 is created and with either original 2-D images 50 or in yet a thfrd scenario 

modified by software including an analysis function 42, with depth maps 55 and/or surface meshes 56 resulting from 

parametric adjustment tools 46, and mark up tools 48. The 65 prior mark up tools 4S usage, to produce further refined 

analysis function 42 uses image processing algorithms, such depth maps 55 and/or surface meshes 56 which are subse- 

as "machine vision" or "image understanding" algorithms as quently displayed in the viewer 44. 
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In a third scenario, the user may elect to utilize the the amount of sparkles perceived in a generated image 

methods described above but with particular attention upon sequence 60 resulting from this filtered data. This filtering 

refining and improving camera parameter data as a combi- can be applied at several different stages in the overall 

nation of original parametric data 51 and parametric adjust- analysis 42 process. 

ments 47. In this scenario the estimated camera parameter 5 Filtering the input image sequence 50 to the scene analy- 

data as part of 45 is adjusted by the user via the parametric 42 suppresses imaging artifacts, such as noise, in the 

adjustment tools 46. The analysis function 42 is performed original sequence of source images. However, by also fil- 

again and the resultant depth maps 55 and surface meshes 56 teriog intermediate results within the analysis process 42, 

are presented to the user through the viewer 44. Through a this smoothing is reflected in downstream steps of the 

combination of visual inspection and quantitative system lo process. For example, when a final result is inversely 

feedback the user can then iteraiively adjust the camera proportional to an intermediate result, then filtering out 

parameter 45 values to produce a more acceptable result. small temporal perturbation errors in the intermediate result 

^„ Ai-F^ prevents major temporal swings in the final result. Filtering 

2. Temporal Smoothing of Scene Analysis Data m ^ . . i, i T j .u « i i« ^ 

o ^ for temporal sparkles can also be done on the final results of 

Image Sequence Generation ^ ^ ^^^^^ ^^^^ ^^^^ ^j^^j,^ 

In accordance with the present invention, a time-based ated images 60. 
sequence of selected data from the scene analysis 42 process Temporal "ripples" arc controlled by filtering and curve- 
is first analyzed as a time-series. The data sets analyzed can fitting (or surface-fitting) techniques. The temporal ripples 
represent inputs, such as the original image sequence 50 reflect estimates that are varying around their true values, 
data; intermediate results, such as camera parameters 45, 20 p^j. example, a flat table in an image such as shown in FIG. 
captured parametric data and information 51, mark up data 4 might be incorrectly estimated by the scene analysis 42 as 
49 related lo the images 50, or adjustments 47 to parametric a surface with bumps on it (possibly from visual artifacts 
data 51; or final results of the scene analysis 42, such as the like specular highlights on the table surface). At each 
depth maps 55 and/or surface meshes 56. time-value in a depth map 55 representing the table, the 

By representing the selected data as a time-series, corre- scene analysis 42 produces a different estimate of the table 
spending data sets can then be compared across time-values. surface (and the bumps on the surface). When the generated 
It is then possible to apply modifications of the data sets, sequence 60 is produced firom this depth map 55, the bumps 
and/or revised constraints on the scene analysis techniques, appear to undulate and move around the surface of the table, 
across the time-series. The modifications to the data sets are Reducing such variations in each data set is a filtering 
made with respect to their effect on a time-based image problem, which can be addressed by any of the means 
sequence 60 generated, in whole or in part, with the various described for temporal smoothing of sparkles. Further con- 
results of scene analysis 42. straining the data sets to remove temporal ripples requires 

In addition, where the image sequence 60 generated by understanding the relationships between the data sets in the 

the scene viewer 44 is likely to display visually inconsistent time series, however. In the table example, an estimate of 

or undesirable artifacts, appropriate filters or scaling factors how the table's surface moves over time would allow 

are applied to modify the data and/or constrain portions of correlation of the data sets in the time-series. These corre - 

the associated scene analysis process 42. After these modi- lated data sets can then be constrained by a curve-fitting or 

fications are applied, the number of visuaUy discernible surface -fitting technique. The fitting process can either be 

artifacts perceptibly decreases in the generated image done on each local data set, or fit across data sets using time 

sequence 60, as will be illustrated later on. as one of the coordinate axes in the curve-fitting algorithm. 

The data sets used in this method, the time-series analysis Temporal "jumps" are smoothed by an aggregate function 

performed, and the filtering and/or scaling factors applied computed over each selected data set in the time-series. This 

depend on the specific scene analysis techniques and algo- can be a weighted or un-weighted aggregate function. Each 
rithms employed. The filtering or scaling factors also depend ^5 selected data set must consist of data points with similar 

on the presence of optional pixel-oriented masks or geom- attributes that can be scalable as a group. For example, the 

etry selections. These can selectively partition the scene estimated heights of object sm-faces in a scene such as the 

analysis data or results, to apply a different filter or scaUng table of FIG. 4 are scaled as a group and can be aggregated 

to different parts of the data. for each image or part of an image in a time -based sequence. 

For example, to eliminate temporal "spikes" in the gen- 50 The aggregate values are compared across data sets in the 

erated image sequence 60, standard curve-fitting techniques time-series, and a scaling factor adjustment is derived for 

can be applied to a time series of corresponding data sets. A each data set which smooths the aggregate values. This 

curve-fitting technique, with time as one axis, is an appro- scaling factor adjustment can be derived either fi-om the 

pri ate spike -removal technique for scene analysis results that average of the aggregate values, or from a curve-fitting 
normally vary smoothly over time. For example, camera 55 process applied to the aggregate values over time. The 

path and orientation parameter 45 data typically have scaUng adjustment factors are then applied to their respec- 

smoothly varying parameters and spikes in these paths are live data sets, 
most likely to be anomalies. Individual results that do not lie 

on the curve, or within a specified tolerance factor (such as 3- Application to A Representative Scene Analysis 

standard deviation) from the curve, arc therefore constrained 50 Technique 

to fit on the curve. For fitting a time series of multi- The application of the temporal smoothing method, with 

dimensional data, surface-fitting techniques can be used. optional spatial smoothing, is now further described with 

Temporal "sparkles" are removed by applying a filtering respect to a representative scene analysis 42 technique. This 

technique, such as a box filler, Gaussian filter, or similar technique derives a series of depth maps 55 and, optionally, 
convolution filter, lo each member of the selected time- 65 surface meshes 56 from an input lime-based image sequence 

series. Filtering each data set in the time-series reduces the 50. As shown in particular detail in FIG. 3, the scene analysis 

number of random variations in the data sets, thus reducing 42 technique includes process for developing an image 
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pyramid 72, computing gradients 74, identifying dominant 
planes 76, planar fit 78, optical flow 80, parallax fit 82, 
camera position and rotation 84, curve and surface fit 88, 
sequence filter 90, and gamma to depth conversion 92. 
Optionally, a depth to mesh conversion 94, pixel masks 96, 5 
and geometry selections 98 may be employed. 

The illustrated analysis function 42 follows a process 
which is known in general as a "planar parallax** method. 
First, the image pyramid process 72 develops an image 
pyramid wherein each image frame of the input image 
sequence 50 is represented as a hierarchical pyramid of 
successive fi"ame representations at successively coarser 
levels of detail, e.g., as a fewer number of pixels. Gradient 
computation 74 is then applied to each pixel at each level of 
the pyramid. Dominant plane identification 76 then deter- 
mines or estimates a dominant plane for each image in the 
sequence. For example, in a sequence of images such as 
shown in FIG. 4 of the kitchen, the dominant plane is 
typically the table. 

A planar fit 78 process then calculates an optical flow field 
or vectors across each image pair, such, for example, using ^ 
the image pyramid. The optical flow process 80 takes the 
resulting set of optical flow vectors which characterize the 
relative movement of objects and/or the camera in the scene, 
and then warps the second image of each image pair in the 
sequence using the optical flow field and subtracting it from 
the first image in each pair, to obtain the residual parallax. 

The parallax fit process 82 then derives a "gamma" 
coefficient for each pixel in the first image of each pair and 
an estimated translation vector for the camera between 
images. The gamma coefficients are inversely proportional 
to the pixel depth from the camera. 

Coefficients from both the planar fit 78 and parallax fit 82 
are then used by the camera position and rotation estimation 
process 84, along with an estimated position of the dominant 
plane. 

After applying filtering 90 to the sequence of gamma 
parameters provided by parallax fit, the estimated camera 
parameters (including available position, rotation and focal 
length parameters 45) and planar position are used to 4Q 
convert each pixel gamma into an estimated pixel depth 
from the camera, which then becomes the per-frame depth 
map 55. The camera parameters are preferably subjected to 
a curve fitting process 88 prior to application in the gamma- 
to-depth calculation. ^5 

An extension to this technique generates a per-frame 
geometric wireframe mesh of the estimated surfaces of 
objects in that image. This surface mesh 56 is generated 
from the depth map 55 and estimated camera parameters 45. 
Various combinations of the original images 50, estimated 50 
camera parameters 45, depth maps 55 and surface meshes 56 
can be used to produce the generated image sequence 60 that 
incorporates the three-dimensional information from this 
scene analysis 42 technique. 

Temporal smoothing according to the invention are thus 55 
applicable to multiple stages of the analysis process 42. For 
example, the per-pixel gammas are smoothed by the smooth- 
ing filter 90 before converting them to depth data. Since 
depth estimates are inversely proportional to gamma, rela- 
tively small frame-lo-frame variations in gammas near zero 60 
can generate large temporal spikes in the associated depth 
data. Such gamma variations come from estimation errors or 
artifacts in the original images (such as specular highlights). 
By using the smoothing filter 90 to smooth the gammas in 
this manner, temporal sparkles and ripples arc also reduced. 55 

Camera path and camera rotation estimates are also 
temporally smoothed from image to image with the curve- 
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fitting process 88. Smoothing camera paths and camera 
rotations can be done at the end of the analysis process 42, 
or as an intermediate result for use in the gamma to depth 92 
calculation. For a sequence of input images 50 captured with 
a zooming camera, temporal smoothing through curve - 
fitting can be applied to the resulting time-series of focal 
length estimates. The sequence of estimated planar positions 
can be similarly smoothed. 

The per-pixel depth estimates in the depth map 55 can 
themselves be filtered for each image. This can be done with 
or without filtering the gammas. Smoothing of per-pixel 
depth estimates in each image is intended to reduce both 
temporal sparkles and temporal ripples in images generated 
from these depth map 55 estimates. 

Depth map 55 data is typically generated within a camera 
coordinate system, where the per-pixel depth values arc 
estimating the distance from the camera in either polar or 
Cartesian coordinates. By using the estimated planar motion 
for the estimated or identified plane in the scene, these depth 
estimates 55 can also be converted using standard geometry 
transformations into estimated heights from the plane. (Note 
that these per-pixel depth estimates are self-consistent 
within each image, but are only within an arbitrary scale 
factor as compared to the depths in the actual scene.) 

These estimated heights can then be aggregated, and their 
aggregate values used in a curve-fitting technique to achieve 
temporal smoothing of jumps in the scale factor. This 
temporal smoothing of the scale factor can then be appUed 
to the depth estimates in each image, for generating an 
image sequence from this sequence of depth estimates that 
has fewer discernible temporal jumps in the scaling. 

In the extension to this example process 42, depth map 55 
data is converted into per- image geometric meshes 56. 
These surface meshes 56 can also be smoothed so that fewer 
temporal artifacts are visible when using a time-series of 
these surface meshes to generate an image sequence. 
Smoothing can be done during the conversion process by 
fitting splines, such as Non-Uniform Rational B -Splines 
(NURBS), to the estimated points in space rather than 
connecting these points into a polygonal mesh. Smoothing 
can also be done after conversion as a post -processing step. 

An image sequence generated from a time-series of 
spline-based surface meshes 56 will generate fewer temporal 
anomalies than one based on a time-series of polygonal 
meshes. This is because polygonal meshes accentuate dis- 
continuities in estimated surface geometry from mesh to 
mesh, whereas spline -based meshes smooth out these dis- 
continuities. 

In aU cases in this example, different parts of the image (or 
geometry) can be selected and treated differently in the 
temporal smoothing process. This allows different levels of 
smoothing to be applied to different parts of the intermediate 
or final results. The differentiation can be specified through 
a separate process, or as part of the interactive parameter 46 
and markup 48 processes that allow the user to control the 
inputs to scene analysis and temporal smoothing techniques 
within an interactive feedback loop. User markup data 49, 
may thus be used to specify areas of an input image 
sequence 50 to which different smoothing filters 90 or 
different curve fitting techniques 88 may be applied. 

For example, the smoothing filter 90 may be relatively 
sharp if the particular concern is to correctly model areas of 
objects which are occluded. For other areas of the images 50, 
such for curved objects such as the soda can on the table in 
FIG. 4, the smoothing filter 90 parameters may be relaxed. 
Still other areas of an input image sequence 50 may be less 
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important than others, and such areas can be removed firom 3. A method as in claim 1 wherein the step of selecting a 

the smoothing. time-based series of data sets selects data from intermediate 

If areas where casting shadows properly is of primary results of the scene analysis selected from the group con- 
concern, sharper filters can be used than for areas in which sisting of camera parameters, captured parametric data, 
softer relighting effects are desired. 5 mark up data related to the images, or adjustments to 

An appreciation for the improvements which may be parametric data, 

accomplished by using the invention is evident from com- 4. A method as in claim 1 wherein the step of selecting a 

paring FIGS. 5, 6, 7-1, 7-2, 7-3, and 8-1, 8-2, 8-3. FIG. 5 is time-based series of data sets selects data from results of the 

a plot of the "raw" gamma values determined from an image scene analysis. 

pair selected from an input image sequence 50 captured 1° 5. A method as in claim 4 wherein the results are a depth 

from the kitchen shown in FIG. 4. FIG. 6 shows the results map. 

of applying the smoothing filter 90. g method as in claim 4 wherein the results are a surface 

FIGS. 7-1, 7-2, and 7-3 represent a sequence of raw height mesh, 
estimates from the depth map 55 for the table and the object 7. a method as in claim 1 wherein the step of selecting at 
on the table. Not only do the estimates appear jagged, but least two time-based scries of data sets additionaUy cam- 
more importantly, the estimates developed for the middle p^ses calculating an optical flow field, 
image of the sequence, shown in FIG. 7-2, are extremely 8. Amethod as in claim 7 additionally comprising obtain- 
erroneous. The result of viewing a generated image • ^ pajan^x fit. 

sequence 60 from this depth map is thus prone to exhibiting « . .u ^ - 1 • -» jj *- n • • j • 

c J . 1 • 11 J 20 9. A method as m claim 7 additionally compnsmg deriv- 

the aforementioned temporal jumps, sparkles and other 0= • . j r o 

anomalies f j f mg a gamma coefiBcient. 

r-T^^ L . 10. Amethod as in claim 1 further comprising, prior to the 

FIGS. 8-1, 8-2 and 8-3 are the height estimates of the table ^ j ^ smoothin filter derivin a amma 

and its objects after application of the smoothing filter 90, ^ ^^a-^. ^ smoo mg er, enving a gamma 

exhibiting far greater continuity in result. The filter 90 „ fo^,*^^^^ of a multiphcity of pixeU, the gamma 

applied in this instance was a "box car" of 5x5 pixels appUed <^ff<^^^!^^ "aversely proportional to the pixel depth 

to an image sequence at a Dl resolution of 720x486 pixels. pixels. 

The middle image of the sequence is now far more accurate, ^ "^^^"^ ^ ^ ^^^^ wherein the step of applying 

and the sequence when viewed no longer exhibits the ^ smoothing filter includes applymg the smoothing filter to 

undesirable anomalies. 3^ *he gamma coeflScients. 

The smoothed results across a range of images may also ^ ^^^"^ ^ ^^^^^^ comprising, after the 

be used as a constraining process. For example, the "best" ^^^P applymg a smoothing filter, convertmg each gamma 

height estimate, that is, the height estimate for an image or coefficient to an estimated pixel depth, 

a portion of an image exhibiting the least variance across a 13. A method as in claim 12 further comprising applying 

range of images can be used to constrain subsequent itera- 35 the smoothing filter to the estimated pixel depth, 

tions of the scene analysis 42. 14. Amethod as in claim 10 further comprising converting 

EQUIVALENTS ^^^^ gamma coefficient to an estimated pixel depth. 

15. A method as in claim 14 wherein the step of applying 

While this invention has been particularly shown and a smoothing filter includes applying the smoothing filter to 

described with references to preferred embodiments thereof, estimated pixel depth 

it will be understood by those skilled in the art that various ^ method for processing a sequence of source images 
changes in form and details may be made therein without ^ ^^^^^ comprising the steps of: 
departing from the spirit and scope of the invention as , . . ■ t ^ ■ , , 
defined by the appended claims. Those skilled in the art wiU (^) ^n^ly^^g the sequence of miages to develop a data 
recognize or be able to ascertain using no more than routine selected charactenstics of the scene; 
experimentation, many equivalents to the specific embodi- (b) selecting at least two time-based series of data sets 
ments of the invention described specifically herein. Such from the data model of selected characteristics pro- 
equivalents are intended to be encompassed in the scope of vided by the scene analysis process; 
the claiins. (c) comparing corresponding parametric values of a given 
What IS claimed is: ^^^j^ ^^^^ ^-^^^ corresponding parametric values of 

1. A method for processing a sequence of source images ^^^^^j. ^^^^ ^^^^^^ ^^^^ 

of a visual scene comprising the steps of: ^ ,x , . . «- . . 

(a) analyzing the sequence of images to develop a data f a gamma coefficient for each of a multiplicity 
model of selected characteristics of the scene; P^^^' S^*"* coefficients bemg mversely pro- 

(b) selecting at least two time-based series of data sets 55 P^'*^^"^^ P^^^ ^^P^^ P^^^^^' 

from the data model of selected characteristics pro- (0 modifying the parametric values and the gamma 

vided by the scene analysis process; coefficients, to obtain visually pleasing results in a 

(c) comparing corresponding parametric values of a given design visualization application. 

time series with corresponding parametric values of 17. A method as in claim 16, wherein the step of modi- 

another times series data sets, and 60 fyi°g includes applying a smoothing filter to the gamma 

(d) applying a smoothing filter and scaling adjustments to coefficients. 

the parametric values to modify the data sets, to obtain 18. A method as in claim 17, further comprising convert- 
visually pleasing results in a design visualization appli- ing each gamma coefficient to an estimated pixel depth, 
cation. 19. A method as in claim 18, further comprising applying 

2. A method as in claim 1 wherein the step of selecting a 65 the smoothing filter to the estimated pixel depth, 
time-based series of data sets selects data from the original 

source image sequence. « * * * * 
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